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Abstract 

We consider the problem of sparse matrix multiplication by 
the column row method in a distributed setting where the 
matrix product is not necessarily sparse. We present a sur- 
prisingly simple method for "consistent" parallel process- 
ing of sparse outer products (column-row vector products) 
over several processors, in a communication- avoiding setting 
where each processor has a copy of the input. The method 
is consistent in the sense that a given output entry is al- 
ways assigned to the same processor independently of the 
specific structure of the outer product. We show guaran- 
tees on the work done by each processor, and achieve linear 
speedup down to the point where the cost is dominated by 
reading the input. Our method gives a way of distributing 
(or parallelizing) matrix product computations in settings 
where the main bottlenecks are storing the result matrix, 
and inter-processor communication. Motivated by observa- 
tions on real data that often the absolute values of the entries 
in the product adhere to a power law, we combine our ap- 
proach with frequent items mining algorithms and show how 
to obtain a tight approximation of the weight of the heaviest 
entries in the product matrix. 

As a case study we present the application of our approach to 
frequent pair mining in transactional data streams, a prob- 
lem that can be phrased in terms of sparse {0, l}-integer 
matrix multiplication by the column-row method. Experi- 
mental evaluation of the proposed method on real-life data 
supports the theoretical findings. 

1 Introduction 

Column row vector products (aka. outer products) are 
ubiquitous in scientific computing. Often one needs to 
compute an aggregate function over several products. 
Most notably, the product of two matrices can be writ- 
ten as a sum of outer products. Observe that a prod- 
uct of two sparse matrices can be dense in the worst 
case, and will typically be much less sparse than the in- 
put matrices. This is a practical problem in algorithms 
that multiply very high-dimensional, sparse matrices, 
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such as the Markov Cluster Algorithm f5T| popular in 
bio-informatics. There are approaches to approximat- 
ing matrix products that can use less space [H [19] but 
they do not address how to balance the work in a par- 
allel or distributed setting. State-of-the-art clusters for 
bio-informatics can easily have a combined main mem- 
ory capacity of 1 TB or more. Thus there is potential 
for computing huge result matrices if computation and 
storage can be used efficiently 

Based on the observation that handling the output 
data, rather than the input data, can be the main bot- 
tleneck we present a method for parallelizing the com- 
putation of sparse outer products in a setting where the 
input is assumed to he broadcast. No other communica- 
tion is allowed — that is, algorithms need to be com- 
munication avoiding. To our best knowledge there is no 
previous work on (sparse) matrix multiplication in such 
a setting. The standard approach is to assume that this 
input is distributed among processors, which then com- 
municate as needed. As noted by Demmel et al. [H] 
communication is often a bottleneck in parallel sparse 
matrix computations. We believe that in some settings 
(such as an ethernet-connected cluster) where broad- 
cast is cheap, but the total capacity for point-to-point 
communication is a bottleneck, an approach that trades 
broadcasting the input for reduced inter-processor com- 
munication can improve performance, assuming that the 
total work remains the same and that the computation 
is load-balanced well. 

Because there can be no communication after the 
broadcast of the input we do not need to consider a par- 
ticular parallel or distributed memory model. However, 
for concreteness we will from now on speak of paral- 
lelization across multiple processors. For practical rea- 
sons we also restrict our experiments to the setting of a 
multi-core architecture. Our algorithm avoids commu- 
nication by distributing the matrix output entries in a 
consistent way, i.e., each processor will process either 
all nonzero terms contributing to a given output entry, 
or none of them. As a case study we consider 0-1 ma- 
trix products arising in data mining applications, where 
the task is to approximate the largest entries in the re- 
sult matrix well. This is done by combining our method 
for parallelizing outer products with known heavy hitter 
algorithms. 



Our algorithm works in a data stream setting, where 
the vectors are given one at a time, and space for sav- 
ing past vectors is not available. In contrast, traditional 
work on parallel matrix multiplication requires that the 
whole input can be stored in the memory of the sys- 
tem. However, our main point is not the space saved 
on not storing the input, but rather that the (possi- 
bly non-dense) output is distributed evenly among the 
processors. We believe that there are many interest- 
ing opportunities ahead in parallel processing of data 
streams, especially for computations that require more 
than (quasi-) linear time in the input size. 

2 Background 

Parallel matrix multiplication. A major algo- 
rithmic problem that can be approached using our 
method is sparse matrix multiplication. The product 
of matrices A,Be M"^" can be computed by summing 
over n outer products of columns of A and rows of B. 
Assuming all outer products are dense a simple algo- 
rithm distributes the computation among several pro- 
cessors such that the work load is equal and there is 
no need for communication between the processors or 
shared data structures: each one of p processors will sum 
over a submatrix of the outer products of size ^ ^ j 
and only the column and row vectors have to be read by 
each processor. This approach, first described by Can- 
non [7], is known to achieve very good scalability with 
a growing number of processors. 

However, for the situation where the outer products 
are expected to be sparse, this simple algorithm is not 
guaranteed to achieve good scalability. The reason is 
that we do not know in advance the specific structure 
of the output matrix and it might happen that certain 
^ X ^-submatrices are dense while many others are 
sparse, which implies that the workload is not balanced 
among processors. An approach to avoid this problem 
is to initially permute the rows (resp. columns) of 
the left (resp. right) input matrix. This corresponds 
to permuting both rows and columns of the output 
matrix. However, this approach is vulnerable to the 
situation in which the nonzero output entries are not 
well distributed among rows or columns. For example, 
if half of the work in computing the output relates to 
a single output row or column, this work will only be 
distributed among ^/p out of p processors. 

Also, it is often the case that we don't know in 
advance the exact dimensions of the sparse matrices 
as they are generated in a streaming fashion. Matrix 
multiplication in the streaming setting has recently 
received some attention [12 [13 [2D]. Randomized 
algorithms have been designed approximating the values 



of individual entries in the matrix product running in 
subcubic time and subquadratic space in one or two 
passes over the input matrices and requiring access only 
to certain columns and rows. 

We refer to [3] for an overview on state-of-the-art 
results on parallel matrix multiplication relying on inter- 
processor communication. 

3 Overview of contributions. 

Parallel sparse matrix multiplication. Hash- 
ing has long been used for load balancing tasks, mapping 
a given task x to a random bucket h{x). A good hash 
function will distribute the tasks almost evenly among 
the buckets, such that different processors will handle 
given buckets and get a similar load. When comput- 
ing a matrix product we can think of each entry of the 
output matrix as a task. However, if we decide to sim- 
ply hash all entries in a product (in parallel), a lot of 
inter-processor communication will be needed to iden- 
tify the entries in each bucket that are nonzero in a 
given outer product. Our approach turns this around, 
and uses a carefully chosen hash function that allows a 
processor to efficiently identify the output entries hash- 
ing to a given bucket. In particular each processor can, 
without communication or shared data structures, pro- 
cess exactly the terms that belong to its bucket, with an 
additional overhead that is linear in the size of the in- 
put vectors. We show that our approach is particularly 
well suited for matrix multiplication by the column-row 
method when individual outer products are sparse but 
not the resulting matrix product. 

Approximate matrix multiplication. As al- 
ready mentioned, randomized algorithms running in 
subcubic time, using a small amount of memory and 
performing only a limited number of scans of the input 
matrix are gaining more and more popularity [121 119|, 
I20| . Instead of computing exactly the matrix product, 
the algorithms return an approximation of individual 
entries. For matrix products where the entries adhere to 
a skewed distribution, the approximation of the heaviest 
entries is known to be of very high quality. The reader is 
referred to [19] for a more detailed discussion and a list 
of applications of approximate matrix multiplication. 

Inspired by the above and observations on real 
data, we compose our parallel matrix multiplication 
method with two known algorithms for finding frequent 
items in data streams: Space-Saving [18] and Count- 
Sketch [5]. The former gives deterministic upper 
and lower bounds on the true value and the latter - 
an unbiased estimator. Both algorithms are capable 
of handling weighted updates but Space-Saving is 
restricted only to positive updates. 



More concretely, our main contributions can be summa- 
rized as follows: 

• A new algorithm for the parallelization of the 
multiplication of sparse matrices by the column row 
method. 

• The combination of the above with approximate 
matrix multiplication. 

• We theoretically analyze the expected time com- 
plexity, the load balancing among cores and the ap- 
proximation guarantee of our algorithm under the 
assumption of Zipfian distribution of the entries' 
weights (a common situation in many real-life ap- 
phcations). In particular, while Count-Sketch 
was always based on an initial hashing/splitting 
step, we believe that this work is the first to in- 
vestigate the composition of hashing and Space- 
Saving for approximate matrix multiplication. 

• Extensive experimental evaluation of our approach 
in the context of frequent pair mining in transac- 
tional data streams. 

4 Preliminaries 

Notation For vectors u,v E M" the outer product 
of u and v is denoted as uv E M"^". For matrices 
A,B E M"^" we denote by the ith column of 
A and by bj the jth row of B for i,j E [n] where 
[n] :— {0, . . . , n — 1}. The ith element in a vector u 
is written as u{i). The weight of the entry {i,j) in 
the product AB is the inner product of the zth row of 
A and jth column of B, for i, j E [n]. Alternatively, 
in the column-row notation it can be also written as 
X]fe=o i-^-' the sum over the (?,j)th entry 
in all outer products. When clear from the context we 
refer to entries with larger absolute weight as heavy 
entries. The number of non-zero entries in m €: M" is 
written as 

A family % of functions from £ to [k] is k-wise 
independent if for a function h : £ ^ [k] chosen 
uniformly at random from "H it holds 

Pr[/i(ei) = ci A h{e2) = C2 A • • • A h{et) = q] = fc"* 

for distinct elements ei E £ , 1 < i < t, and c; E [k]. We 
will refer to a function chosen uniformly at random from 
a /c-wise independent family H as a fc-wise independent 
function. 

The elements in £ adhere to Zipfian distribution 
with parameters C and z ii Wi — C/i^ for the absolute 
weight Wi of the ith most heavy element in £. 



Frequent items mining algorithms. We will 
use two well-known frequent item mining algorithms 
Count-Sketch [S] and Space-Saving [TS] as subrou- 
tines. We give a brief overview of how they work. 

In Count-Sketch every item i is hashed by a 
hash function /i : [n] — >■ [k] to a position in an array 
CS consisting of k estimators, each of them being a 
real number. Upon updating the weight of an item 
i, we add 1 or —1 to the corresponding estimator 
by using a uniform sign hash function s{i) evaluating 
i to either 1 or —1. After processing the stream 
the frequency of a given item i can be estimated as 
CS[h{i)] ■ s{i). The intuition is that for a heavy item the 
contribution from other items will cancel out and will 
not significantly affect the estimate. Both h and s are 
pairwise independent and this is sufficient to show that 
for an appropriate number of estimators and a skewed 
distribution of item weights the heavy items will be 
assigned to only one estimator with high probability. 
The probability for correct estimates can be amplified 
by working with t > 1 hash functions and returning the 
median of the t estimates upon a query for the frequency 
of a given item. 

The Space-Saving algorithm offers upper and 
lower frequency bounds, rather than an unbiased esti- 
mator. It keeps a summary of the stream consisting of £ 
triples (item^ , county , overestimatiouj ) , 1 < j < £■ The 
£ triples are sorted according to their count value. Upon 
the arrival of a new item i the algorithm distinguishes 
between the following cases: 

• If not all £ slots are already full, we insert a new 
triple as (i, 1, 0). 

• If i is already recorded, we increase the correspond- 
ing counter by 1 and update the order in the sum- 
mary. 

• Otherwise, we replace the last triple 
(itemf, county, overestimation^) with a new 
triple (i, county + 1, county). 

After processing the stream we return as an estimate for 
an item weight either its counter in the summary or, if 
not recorded, the overestimation in the last counter in 
the list. For a stream of length m, the overestimation in 
a given counter is bounded by m/£ and it is guaranteed 
that an item occurring more than m/£ times will be 
in the summary. This comes from the fact that each 
increase in the overestimation of an item not recorded 
in the summary is witnessed by £ different items in the 
summary and this cannot happen more than m/£ times. 
However, the algorithm is known to perform extremely 
well in practice and report almost the exact weights for 
the heaviest entries, see e.g. [TU] . 



A natural generalization of both algorithms works 
with weighted positive updates. This is straightforward 
for Count-Sketch, and for Space-Saving the only 
issue to consider is how to efBciently update the sum- 
mary. A solution achieving amortized constant time per 
update is presented in 0. 

5 Our approach 

The algorithm. Assume we are given matrices 
A,Be M"^" such that A is stored as column-major 
ordered triples and B as row-major ordered triples. (For 
an overview of efficient implementations of the model we 
refer to Chapter 2 of [4-) The skeleton of our algorithm 
is the following: 

• Define s Count-Sketch estimators (or alterna- 
tively Space-Saving summaries with £ triples, for 
a small constant £). 

• For the feth column and row vectors € A,bk € B, 
k € [n], hash each entry in ab to one of the 
estimators. 

• After all outer products have been processed return 
an estimate for all entries. 

In the parallel version, each processor keeps track of 
a subset of the estimators, and the total space remains 
fixed. Thus, we are in a shared nothing model with no 
need for a shared memory - the only requirement is that 
each processor sees the column and row vectors for each 
outer product. Pseudocode of the algorithm is given in 
Figure [T] We refer to it as the CRoP algorithm, which 
refers to the fact that each processor crops the output 
matrix to produce a small fraction of it, and is also an 
acronym for "Column-Row Parallel" . 

A crucial property in our analysis and experimental 
evaluations is that the heaviest entries do not often 
collide, and thus we obtain high quality estimates 
on their weight. We combine two different ways for 
estimating the weight of the heaviest entries based on 
the Count-Sketch and Space-Saving algorithms. In 
particular, we use a distribution hash function h : 
[n] X [n] [k] to split the set of entries into k parts, 
and use a Space-Saving sketch and a Count-Sketch 
estimator on each part. The size n of the hash table 
and the size of the Space-Saving sketch determines 
the accuracy of the estimates. 

Parallel processing of entries. Naively we could 
just iterate through all entries of each outer product, 
but we would like an algorithm that runs in time linear 
in the number of nonzero elements in the input vectors 
and the entries hashing to a given interval of the hash 
table. In other words, given sufficient parallelism we can 



handle a given data rate even if there is a huge number 
of entries in a given outer product. 

Lemma 1. Let ha, hi, : £ — > [k] be pairwise independent 
hash functions. Given input vectors a and b with 0{t) 
nonzero entries each and an interval C = [q, g-f 1, . . . , r) 
for < q < r < K we can construct E^'' , the set of 
entries occurring in the outer product ab hashing to a 
value in C, in expected time 0{\£^^\ +t). 

Proof. We exploit the special structure of our hash 
functions: h{i,j) — {ha(i) + hi,[j)) mod k for a 2- 
wise independent hash function ha,hi, : N — > [k] and 
i,j € [n]. It is easy to show that this construction 
implies that h : £ ^ [k] is also pairwise independent. 
To find the entries with a nonzero value in ab that hash 
to [q, r) we first sort the indices with a nonzero value 
in each a and b according to the hash value in arrays 
Ha and Hb for a and b, respectively. Entries with a 
hash value in the right range correspond to elements in 
Ha and Hi, with sum in [q,r) U [k + q, k + r). Since 
the hash values are pairwise independent we can sort 
by bucket sort in expected linear time. One way to find 
the entries in the right range would be to iterate over 
elements of Ha, and for each do two binary searches 
in Hi, to find the values in the right ranges. However, 
this can be further improved by processing Ha in sorted 
order, and exploiting that the entries with hash value in 
the right range correspond to intervals of Hh that will 
be moving monotonically left. Thus, for a fixed index 
i we find the entries j S [n] with a hash value in 

[q, r)U [K + q, K + r) in time linear in the number of such 
entries. This brings down the time to C'df^''! -I- t). 

Parameters. We will assume that data is not 
lightly skewed and z > 1/2. We will distinguish between 
the cases when 1/2 < z < 1 and z > 1. In order to 
keep the presentation concise the particular case z = 1 
will not be analyzed. In our analysis we will also use 
as a parameter the total number of nonzero entries d 
occurring in the matrix product. The value of d is 
not known in advance. One can either be conservative 
and assume d = 0{n^), or in the case of positive 
input matrices use efficient methods for estimating the 
number of nonzero entries [2] if two passes over the input 
are allowed. 

6 Analysis 

Load balancing. First we show that pairwise in- 
dependent hashing of entries guarantees good load bal- 
ance among processors. We show that for outer prod- 
ucts for which the number of non-zero entries is con- 
siderably bigger than the number of processors, CRoP 
achieves good scalability with high probability. 



function CRoP 

Input: matrices A, B £ R+"^", Interval [q,r), a list k Space- 
Saving summaries SS each for £ entries, array for ft real 
numbers CS, pairwise independent hash functions ha, hi, : 
N^> [k], s:£^ {-1,1} 

1: for i S [n] do 

2: a := ith column of A, b := ith row of B 

3: E := }lASH{ha,ht,,a,b,[q,r)) 

4: for e £ E do 

5: Update Space-Saving summary SS[h{e)] with 

the entry e. 

6: CS[h{e)]=CS[h(e)] + e-s{e). 

function HASH 

Input: pairwise independent hash functions ha,hi,:N—> [k], 
vectors a e M", 6 e M", Interval [q, r) 



1: 

2 
3 
4 
5 
6 
7: 
8 
9 

10; 
11 

12 

13 
14 



sort the nonzero entries in a and b according to ha and /15 in 

arrays Ha and Hi,, respectively 

k = Hb.length, E = d 

for i = 1 to Ha -length — 1 do 

while Ha [i] + Hi, [k] > q do 
if A: > 0: k = k-1 

j = k 

while Hall] + Ht[j] < q do 
j = j + i 

if j >= Hi,.length: break for-loop 
while Ha[i] + Hi,[j] < r do 
E = EU{i,j) 
if j < Hi,.length: j = j + 1 
else: break for-loop 
return the set E 



Fi gurG li [CRoP algorithm] A high-level description for a 
single run of our algorithm. The description of the hash functions 
h : £ [k] and s : £ — > { — 1,1} is sent to all processors before 
the computation starts. For a detailed pseudocode description 
of Space-Saving and Count-Sketch the reader is referred to the 
original works [S] I18| . After processing the input matrices one can 
obtain an estimate for individual entries from the corresponding 
Space-Saving summary or Count-Sketch estimator. 



Theorem 1. Suppose we run K instances of CRoP 
with disjoint intervals of the same size, on an input of 
column and row vectors a, 6 G M". Let W = . Then 

for W > ^"^^^''^ the probability that the number of entries 
processed by a given instance will deviate by more than 
XW from its expected value is bounded by ^i^^y^i^^^^ for 
any A > 0. 

Proof. We distribute the nonzero entries in ab to K 
processors. We consider one of the K processors, say 
Pi. Let Ye be an indicator random variable denoting 
whether the entry e is hashed to the range of Pi. 
Clearly, E[Y,] - 1/K. Let X - Eeeab^e- Thus, we 
have E[X] = W . Since the hash function is pairwise 
independent for the variance of X wc have V[X] = 



^ E[X] = W. Using Chebyshev's inequality 
and W > i^JpM estimate the probability that this 
number deviates by more than A to: 



Ft[\X-E[X]\ > XW] < 



K 



V[X]K^ 

(AHH?~ A^ 



< 



J_ 
e^ab K 



The above theorem essentially says that when we 
expect a sufficient number of entries per core in a given 
outer product, the probability that the work load will 
deviate by more than a small constant factor from its 
expectation, namely W/K, is very small. Thus, we 
expect that the computation of only a small fraction 
of the outer products will not be well balanced among 
processors and this will not considerably affect the total 
performance. 

In the following we present results for the estimates 
based on Space-Saving (giving guarantees on the 
upper/lower bounds), as well as the unbiased Count- 
Sketch estimator returned by our algorithm. 

Quality of estimates. 

Theorem 2. Let A,B be n x n nonnegative matrices 
such that there are d nonzero entries in AB adhering to 
a Zipfian distribution with parameters C and z. After 
processing all outer products Oibi, 1 < i < n, by CRoP 
with k buckets and a Space-Saving data structure with i 
entries 

• if z > \, an entry with weight at least is 
recorded in one of the Space-Saving summaries with 
probability more than 1/2. By running O(log^) 
instances o/CRoP in parallel, we report all entries 
with weight ^{jj^) or more with probability at 
least 1 — ^. 

• if z < 1, an entry with weight at least 
i7(max(p^, ^^^j — )) is recorded in one of the 
Space-Saving summaries with probability more than 
1/2. By running O(log^) instances of CRoP 
in parallel, for K = max(£fc, jj^), we report all 
entries with weight r2(max(p^, — )) or more 
with probability at least 1 — S. 

Proof. Let the minimum absolute weight for heavy 
entries be am, a > 0. (At the end of the proof we 
will obtain bounds on am depending on k.) We will 
estimate the probability that a heavy entry e is not 
reported. From the Zipfian distribution we obtain that 
X :— (-^)^ entries will have weight above am. Let 
B be the bucket e hashes to. We first consider the 



case that at most €/2 entries with weight above am 
are hashed to B. The expected number of heavy entries 
hashed to B is x/k. For k > ^ we get by Markov's 
inequahty that the probabihty pi that more than £/2 
heavy entries wiU land in B is at most 1/6. As aheady 
discussed in Section |4j if the total weight of non-heavy 
entries hashed to B is less than iam/2 and no other 
heavy entries hash to B, then all heavy entries will be 
reported. Let w := J2'i=x+i be the total weight of 
non-heavy entries. In the following we will use the fact 
that w = 0(C(di-^)) for z < 1 and w = 0{C{x^-')) 
for z > 1, where d is the number of distinct entries 
in the matrix product. This follows by integration of 
the corresponding continuous function. Then we expect 
non-heavy entries of total weight w/k to land in B. 

Let Yj be an indicator random variable denoting 
whether the jth non-heavy entry is hashed to B and 
^ = Ei=i^i- Clearly, E[X] = w/k. Applying 
Markov's inequality we obtain p2 '■— Pr[X > 3w/k] < 
1/3. We want 3w/k < lam/2. For z < 1 we 
have w = 0{C{d}~^) thus together with the bound 
on the number of heavy entries in the bucket we set 
k = max(0(i(£^)^),0(^^). Similarly, for z > 
1 we have w = 0{C{x^^^), hence the bound k = 
0(j(^)5). Thus, for z < 1, we will consider entries 
heavy if their weight is am — fl{max.{ , ^^^fj. — )) and 
for z > 1 if am = ^{jj^). By the union bound the 
probability that an entry with weight at least am is not 
reported is at most pi+p2 < 1/2. 

By running t copies of the algorithm in parallel 
and reporting only entries that are reported in some 
summary in at least t/2 cases, we can amplify the 
probability for a correct estimate to exp(0(t)). The 
analysis follows a standard application of Chernoff's 
inequality. Thus, for the given values of t and K 
for each heavy entry we bound the probability of not 
being reported to ^ and for z > 1 and z < 1, 
respectively. Since the number of heavy entries in each 
case is bounded by £k or K, by the union bound we 
bound the error to 6 that any heavy entry will not be 
reported. 

Count-Sketch was analyzed for Zipfian data in the 
original publication [ff^ . 

Theorem 3. fS^ Let A,B e M"^" such that the entries 
in AB adhere to a Zipfian distribution with parameters 
C and z > 1/2. After processing all outer products 
akbk by CRoP with k Count-Sketch estimators, each 
entry is approximated with additive error of at most 
ri(|^) with probability more than 1/2. By running of 
0{j) instances in parallel the additive error of at most 
r2(^) for any entry holds with probability at least 1 — S. 



A lower bound. We present a lower bound for the 
space needed to report the heaviest entry in a matrix 
product by the column-row method. Of course, this 
applies also to any harder problem, such as reporting a 
larger set of heavy entries, with estimates. 

Theorem 4. Any algorithm that always outputs the 
heaviest entry in a matrix product AB, or even just the 
weight of the heaviest entry, in only one pass over the 
outer products aibi, i G [n], must encode in its state 
all entry weights, in the sense that if two prefixes of 
the outer products differ in the weight of some entry, 
the algorithm must be in different states after seeing the 
prefixes. 

Proof. Since we are allowed only one pass over the 
column and row vectors of the input matrices, we will 
consider the outer products as a stream of updates 
for the output entries. For a prefix of the stream 
consider the count vector that for each entry records 
its weight in the prefix. Let A be any algorithm that 
computes the heaviest entry in a data stream. Consider 
two distinct weight vectors x and y, corresponding to 
diff'erent stream prefixes. We argue that A must be in 
two different states after seeing these prefixes. Suppose 
that the latter claim is not true, so that for x ^ y the 
algorithm is in the same state. Since x and y differ, there 
must be at least one entry (i, j) having distinct weights 
in the two weight vectors. But this implies that we 
can extend the streams with a sequence that makes the 
entry the heaviest one in one of the weight vectors, 
say w.l.o.g. X, becomes the heaviest entry, while 

this does not happen in y. Still, the algorithm would 
be in the same state in both cases, returning the same 
result and weight. This contradicts the assumption that 
A always returns the correct answer, so the assumption 
that X and y resulted in the same state must be false. 

Intuitively, the only generally applicable ways of 
storing the weights of all entries is to either store each 
weight explicitly, or store all column and row vectors 
seen so far. 

7 Frequent pair mining in transactional data 
streams: a case study 

Our original motivation was to parallelize the mining 
of frequent pairs from a high speed transaction stream. 
As already explained, the problem can be seen as an 
instance of sparse matrix multiplication AA^ by the 
column row method. 

We use the following notation for transactional data 
streams. Let I be a set of items. A transaction T d X 
is a set of items. We call a subset p = {i,j} C I a 
pair. For a stream of transactions S, the support of 



a pair p is the number of transactions containing p: 
sup(p) = |{r : p C T}\,T e S. We can consider each 
transaction T as a {0, 1}- valued sparse n-dimensional 
vector vt and the set of pairs occurring in the given 
transaction as the nonzero entries above the diagonal in 
the outer product of vt and its transpose. 

7.1 Frequent Pattern Mining. Considerable work 
has been done on parallel and distributed implementa- 
tions of frequent itemset methods. We refer to Zaki's 
1999 survey for an overview of some main tech- 
niques that have been investigated. Subsequent work 
has focused on either multi-core computing, aiming to 
minimize the overhead of access to shared data struc- 
tures [131 HI] , or more recently GPU computing, focus- 
ing on representations of data that allow efficient GPU 
implementations (see e.g. [161 [15] for recent results). 

To our best knowledge, all known methods for par- 
allelising frequent itemset mining use either a shared 
data structure, or a "vertical" , column-by-column lay- 
out of the data (for each item, a sorted list of the trans- 
actions where it occurs). In the former case the shared 
data structure will become a bottleneck when scaling to 
many cores. In the latter case one can no longer have 
a space-efficient streaming algorithm, because comput- 
ing the vertical representation requires storing all trans- 
actions. It is also well-known that for sparse matri- 
ces this representation, which supports fast computa- 
tion of inner products, leads to higher time complexity 
than methods based on outer products. See [3j for an 
overview of theoretical results on (serial) sparse matrix 
products. 

7.2 Related vi^ork on stream mining. Though we 
view the new load balancing technique as the main 
contribution of our work, its application to stream 
mining is interesting in its own right. In the following 
we briefly review related work in this area. 

Manku and Motwani p/Tj first recognized the neces- 
sity for efficient algorithms targeted at frequent item- 
sets in transaction streams and presented a heuristic ap- 
proach generalizing their StickySampling algorithm. 
A straightforward approach to mining of frequent pairs 
is to reduce the problem to that of mining frequent items 
by generating all item pairs in a given transaction. Yu et 
al. [22] and Campagna and Pagh [5] present randomized 
algorithms for transaction stream mining. The theoret- 
ical bounds on the quality of their estimates however 
heavily depend on the assumption that transactions are 
either generated independently at random by some pro- 
cess or arrive in a random order. It is already clear from 
the experiments of 6 that such optimistic assumptions 
do not hold for many data sets. For both schemes [22J|S] 



Dataset 


# of pairs (F2) 


# of distinct pairs 


Mushroom 


22.4 ■ 10^ 


3.65 ■ 10^ 


Pumsb 


1360 ■ 10^ 


536 ■ 10^ 


Pumsb_star 


638 ■ 10^ 


485 ■ 10^ 


Kosarak 


3130 ■ 10^ 


33100 ■ 10^ 


Retail 


80.7 ■ IQS 


3600 ■ 10^ 


Accidents 


187 ■ 10^ 


47.3 ■ 10^ 


Webdocs 


2.0 ■ 10" 


> 7-1010 


Nytimcs 


1.0 • 10"' 


> 5 ■ 10* 


Pubmed 


1.6 • 10"' 


> 6 ■ 10« 


Wikipcdia 


5.17- 10" 


> 5.8 • 10« 



Table 1: Information on data sets for our experiments. Nytimes 
and Pubmed are taken from the UCI Machine Learning Repos- 
itory (Bag of Words data set). The wikipedia dataset has been 
crafted according to what is described in [l] Page 14]. The other 
data sets are from the Frequent Itemset Mining Implementations 
(FIMI) Repository. For the last three data sets the number of 
distinct pairs was estimated using a hashing technique from [2]. 
In the context of sparse matrix multiplication the number of pairs 
is the total number of nonzero entries in all outer products and 
the number of distinct pairs is the number nonzero entries in the 
output matrix. 

it is easy to find an ordering of essentially any trans- 
action stream that breaks the randomness assumption, 
and makes it perform much worse than the theoretical 
bounds. We therefore believe that a more conservative 
model is needed to derive a rigorous theoretical analysis, 
while exploiting observed properties of real data sets. 

7.3 Experimental evaluation. We worked with 
a cache-optimized Java implementation working only 
with primitive data types and used the built-in random 
number generator to store hash values in a table. 
Unless otherwise reported we ran experiments on 
as Mac Pro desktop equipped with Quad-Core Intel 
2.8GHz and 16 GB main memory. In this architecture 
there are 8 cores and a total of 24 MB cache available 
but 2 cores share 6 MB of cache. 

Datasets. We evaluated the performance of our al- 
gorithm on the following datasets: Mushroom, Pumsb, 
Pumsb_star and Kosarak, taken from the Frequent 
Itemset Mining Implementations (FIMI) Repository, 
Wikipedia - crafted according to what is described in pQ 
Page 14] , and Nytimes and Pubmed taken from the UCI 
Machine Learning Repository. 

Table [T] summarizes the data sets used in experi- 
ments. In all cases, we use the order in which the trans- 
actions are given as the stream order. 

Accuracy of results. Our first set of experiments 
shows results on the precision of the counts obtained 
by CRoP using a Space-Saving data structure of size 
2, i.e., we record only two pairs. The accuracy depends 
on the amount of space used and the number of pairs we 



are interested in reporting. Assume the pairs are sorted 
in decreasing order according to their frequency, we say 
that ith pair has rank i. We made one experiment 
fixing the space usage, and looking at results for pairs 
of decreasing rank (computed exactly), and one that 
varies the space usage and considers the top-100 pairs. 
In practice, it may be hard to foresee how much space 
will be needed for a particular stream, so probably one 
will tend to use as much space as feasible with respect 
to running time (ensure in-cache hash table), or what 
amount of memory can be made available on the system. 
A consequence of this will be even more precise results. 
The results of our experiments on the Nytimes data set 
can be seen in Figure [2ja). 

Varying space usage We now investigate what 
happens to the quality of results when the space usage 
of CRoP is pushed to, and beyond, its limits. For this, 
we chose to work with 3 representative data sets, namely 
Mushroom, Retail and Accidents, for decreasing space 
usage, plotting the ratio between the upper and lower 
bounds for the top-100 pairs returned by our algorithm. 
This is shown in Figure ^h) and we can see how the 
transition between very poor and very good quality is 
fairly fast. 

Larger Space-Saving data structures. We es- 
timated the number of pairs with ratio at least 90% 
between lower and upper bound by varying the number 
of pairs in each bucket but keeping the total number 
of pairs of our sketch constant. We counted the num- 
ber of pairs until the 100-th bad estimate, i.e. until 
100 pairs have a lower bound less than 90% of its up- 
per bound. Since our algorithm is randomized we chose 
100 as cut-off in order not to be sensitive to outliers in 
the estimates. We fixed the total number of pairs for 
the Kosarak dataset to 240000 and for the Accidents 
dataset to 6000. We then varied the pairs per bucket 
from 2 to 24. The effect was much better estimates with 
the same space usage, see Figure [3(a) 

Count-Sketch Estimates The result of the un- 
biased COUNT-SKETCH estimator for the Kosarak 
dataset with 50000 buckets and 2 pairs per bucket is pre- 
sented in Figure [3(b)| We ran the algorithm 11 times 
and for each pair reported at least 6 times we return 
the median of its estimates. The plot shows the ra- 
tio of our estimates and the exact count of the 3000 
pairs with highest support computed by Apriori. Not 
reported pairs have ratio 0. In general we observe that 
the estimates given by Space-Saving are tighter. 

Scalability. In order to assess the scalability of our 
approach, we performed the following experiment: We 
ran the algorithm with a Space-Saving data structure 
on the Kosarak dataset with a sketch of size lOOOOi on i 
cores. For each run we computed the recall for the top 
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Figure 2: (a) The plot on the left side shows upper and lower 
bounds for Nytimes computed by CRoP using 10® buckets. Values 
are normalized by dividing by true support. Upper bounds 
shadow lower bounds, exact bounds are visible only as a red 
dot with no blue dot below. As can be seen, upper bounds are 
generally tighter than lower bounds, (b) On the right-hand side 
we plot the average ratio of lower and upper bound for top-100 
pairs, for three representative data sets, as function of number of 
buckets. As can be seen, there is a quick transition from poor to 
excellent precision. 

10000 pairs in the dataset, see Table [2] The idea is that 
each core can handle a sketch of size 10000 using only its 
local cache. The running time is the time needed for all 
processes to complete. Given that each core needs about 
8 sec to read all input transactions, sort it according 
to the hash values of the items and then decide which 
pairs it is responsible for, a very good scalability is 
observed. The results indicate that using parallelism 
our approach is both more efficient and achieves more 
accurate results. 

Load balance. We have run experiments evaluat- 
ing the distribution of pairs among the buckets. When 
running these experiments, we kept track of the num- 
ber of pairs processed by each core. The results of these 
experiments are reported in Table [3j The numbers in 
the table confirm that the pairs spread evenly amongst 



kosarak 



Cores 


1 


2 


4 


8 


Time(sec) 


22.13 


14.87 


11.73 


9.96 


Recall 


0.1119 


0.1988 


0.3156 


0.4812 



Table 2: Recall for Kosarak. 
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Figure 3: On the left side is the number of estimates with ratio 
of at least 90% until 100 bad estimates have been seen for a given 
number of pairs in each Space-Saving summary. As we see a sharp 
improvement in the quality of the estimates can be observed by 
increasing the number of pairs in each summary. On the right- 
hand side the plot represents the ratio of estimates and true count 
for the top 3000 pairs of Kosarak. All top 1400 pairs are reported 
by our algorithm and for most of the pairs the estimates are within 
of factor 2. However, the estimates are worse than those given by 
Space-Saving with the same total space usage. 



the buckets meaning that parallelism greatly improves 
the running time of the algorithm, since there will be 
no core that has to sustain a much larger burden than 
the others; such a negative situation would bring the 
performances of the algorithm close to a sequential one. 

Progress of processing. We ran four processes, 
and let the operating system allocate one to each core. 
We tracked the progress of each process over time. The 
result for the Kosarak dataset are shown in Figure [4(b) | 



Dataset 


Cores 


Average 


Maximum 




8 


895542 


934040 


Retail 


4 


1791084 


1808784 




2 


3582168 


3585869 




8 


3203546 


3219769 


Kosarak 


4 


6407091 


6427398 




2 


1.2814 ■ lO'^ 


1.2827 ■ lO'^ 




8 


8.928 ■ 10* 


8.9394 ■ 10* 


Webdocs 


4 


1.7856 ■ 10^ 


1.7871 ■ 10^ 




2 


3.5712 ■ 10^ 


3.573 ■ 10^ 




8 


1.2497 ■ 10^ 


1.251 ■ 10^ 


Nytimes 


4 


2.4995 ■ 10^ 


2.5005 ■ 10^ 




2 


4.9989 ■ 10^ 


4.9995 ■ 10^ 


Wikipedia 


8 


6.4582 ■ 10^" 


6.4651 ■ lO^o 



and for Nytimes dataset in Figure 4(a) As can be seen, 
the cores make almost identical progress when running 
at full speed. In a data streaming setting this means 
that we can expect to manage streams that require all 



Table 3: The table shows the average and maximum number 
of pair occurrences handled by each core. As can be seen, the 
maximum is quite close to the average. 



cores to run at close to maximum speed, while needing 
to cache only a small number of transactions. 

Somewhat surprisingly and inconsistent with the 
above results, we observed however that our experi- 
ments on a multicore CPU for various architectures 
do not always suggest very good scalability with in- 
creased number of cores. For example, the Webdocs 
dataset consists of long transactions only, thus the time 
for reading the input should not dominate. However, 
we observed that even if the processing time becomes 
better with more cores the advantages of having more 
cores become less and less pronounced. In particular, 
the running time between processing the data set with 
eight cores instead of four cores decreases by a factor of 
only 1.25 while the ratio of the running time between 
one core and two cores is about 1.85. The reason is that 
each core has a small amount of dedicated L2 cache 
which is not sufficient to keep the whole interval of the 
hash table assigned to it and this leads to memory con- 
tention among processes. This means the potential of 
our method can be fully exploited when distributing the 
work to several processors or when we work with rather 
small sketches of the data stream. 

More precisely, experiments have been carried out 
in order to verify how the algorithm scales, in terms of 
time, when parallel computations are used. We ran the 
algorithm on various datasets using different number 
of cores in order to highlight the parallel nature of 
the algorithm. Table [4] reports some of the results we 
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Figure 4: On the left side we see the progress of the current 
transa<;tion number of each core running CRoP on the Nytimes 
data set, over a period of about 400 seconds and on the right-hand 
side the total time recorded for every 100th transaction when 
running CRoP with 4 cores on the Kosarak data set. 



obtained. 

As one can sec the exact running times for the eval- 
uation of the scalability of our algorithm appear some- 
what cumbersome since the improvement in running 
times does not suggest very good scalability. This is not 
due to a flaw in our implementation but a consequence 
of the specific Xeon E5570 architecture. Each core has 
a small amount of dedicated cache, namely 256 KB. In 
order to obtain correct estimates we need a large num- 
ber of buckets which do not fit in cache and this leads 
to memory contention. For considerably smaller size 
of our sketch we achieve almost perfect scalability. We 
ran another set of experiments on a Mac Pro desktop 
equipped with Quad-Core Intel 2.8GIIz. In this archi- 
tecture there are 8 cores and a total of 24 MB cache 
available but 2 cores share 6 MB of cache. The running 
time for 4 parallel processes was about two times better 
than the running time for 2 parallel processes but the 
improvement for 8 processes was not as good, clearly in- 



Dataset 


# of cores 


ms on # cores 


ms 1 core 


Retail 


8 


1321 


2001 


4 


1193 


2 


1512 


Kosarak 


8 


1551 


2881 


4 


1586 


2 


1997 


Webdocs 


8 


299153 


891565 


4 


357679 


2 


482111 


Nytimes 


8 


443119 


1313698 


4 


524553 


2 


689058 


Wikipedia 


8 


27526403 


93477243 


4 


35397110 


2 


53795313 



Table 4: Experiments ran on an Intel Xeon E5570 
2.93 GHz equipped with 23 GB of RAM; the OS is 
GNU/Linux, kernel version 2.6.18. The number of 
processes used is 8 for all four datasets. Times are 
given in milliseconds (ms). The number of buckets 
for the largest datasets is of the order 2^". Wc can 
observe that in any case, millions of pairs per second 
were manipulated by the algorithm. 



dicating cache contention among 8 cores and an optimal 
use for 4 processes. 

For a large dataset with a high number of pairs such 
as Nytimes our simple Java implementation nevertheless 
processed almost 100 million pairs per second on an 
8-core Mac Pro, the total running time was about 
110 seconds. The sketch contained 400000 buckets 
which was enough to find the exact counts of the top 
few hundred pairs. The throughput of a competing 
hash table solution can be upper bounded by the 
number of updates of random memory locations possible 
(disregarding time for hash fimction computation, and 
other overheads). On the Mac Pro the number of such 
updates per second was estimated to around 50 millon 
per second, when updating a 1 GB table using 8 cores. 
This means that we are at least a factor of 2 faster 
than any implementation based on a large, shared data 
structure. 
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